In previous lessons, you learned how to perform a range of wrangling
operations like filtering, mutating and summarizing. But so far, you
only performed these operations one column at a time. Sometimes
however, it will be useful (and efficient) to apply the same operation
to several columns at the same time. For this, the
across() function can be used.
Let’s see how!
Fig: the across() verb.
You can use across() with the mutate()
and summarize() verbs to apply operations over multiple
columns.
You can use the .names argument within
mutate(across()) to create new columns.
You can write anonymous (lambda) functions within
across()
This lesson will require the packages loaded below:
if(!require(pacman)) install.packages("pacman")
pacman::p_load(here, tidyverse)In this lesson, we will again use data from the COVID-19 serological survey conducted in Yaounde, Cameroon.
yaounde <- read_csv(here("data/yaounde_data.csv"))
yaounde <- yaounde %>% rename(age_years = age)
yaoundeWe will also use data from a hospital study conducted in Burkina Faso, in which a range of clinical data was collected from patients with febrile (fever-causing) diseases, with the aim of predicting the cause of the fever.
febrile_diseases <- read_csv(here("data/febrile_diseases_burkina_faso.csv"))
febrile_diseasesFinally, we will use data from a dietary diversity survey conducted in Vietnam, in which women were asked to recall (one or several days) the foods and drinks they consumed the previous day.
diet <- read_csv(here("data/vietnam_diet_diversity.csv"))
diet <- diet %>% rename(household_id = hhid)
dietacross() with mutate()The mutate() function gives you an easy way to create
new variables or modify in place variables.
But sometimes you have a large number of columns to operate on, and
typing out mutate() statements line-by-line can become
onerous. In such cases across() can radically simplify and
shorten your code.
Let’s see an example.
Consider the symptoms columns (from symp_fever to
symp_stomach_ache) in the yaounde data
frame:
yao_symptoms <-
yaounde %>%
select(age_years, sex, date_surveyed, symp_fever:symp_stomach_ache)
yao_symptoms The 13 columns between symp_fever and
symp_stomach_ache indicate whether or not each respondent
had a specific COVID-compatible symptom.
Now, imagine you wanted to convert all these columns to upper case.
(That is, “Yes” to “YES” and “No” to “NO”). How might you do this?
Without across(), you would have to mutate the columns one
by one, with the toupper() function:
yao_symptoms %>%
mutate(symp_fever = toupper(symp_fever),
symp_headache = toupper(symp_headache),
symp_cough = toupper(symp_cough),
symp_rhinitis = toupper(symp_rhinitis),
symp_sneezing = toupper(symp_sneezing),
symp_fatigue = toupper(symp_fatigue),
symp_muscle_pain = toupper(symp_muscle_pain)
#... And on and on and on and on and on
)This is obviously not very time-efficient. An experienced data analyst who saw this code might scold you for not obeying the DRY (“Don’t Repeat Yourself”) principle of programming.
But with the across() function, you have the power to do
this in all of two lines:
yao_symptoms %>%
mutate(across(.cols = symp_fever:symp_stomach_ache,
.fns = toupper))Amazing!
Let’s break down the code above. We used across() inside
of the mutate() function, and provided it with two main
arguments:
.cols defined the columns to be modified.
The symp_fever:symp_stomach_ache code means “all columns
between symp_fever and
symp_stomach_ache”.
.fns defined the functions to apply on the
selected columns. In this case, the toupper function was
applied.
And that’s the basic gist of across()! But below we’ll
consider each of these arguments in a bit more detail.
Why follow the DRY (Don’t Repeat Yourself) principle?
There are many reasons to avoid repetitive code. Here are just a few:
toupper
to tolower), you won’t need to make the same change in
several places. You can fix it in a single place..cols argumentNow let’s look at the arguments of across() in some more
detail.
As mentioned above, the .cols argument of
across() selects the columns to be modified.
Most the different methods you have learned for selecting columns can be used here.
One difference with the classical use of select() is
that to list column names with across(), you must wrap them
in c():
yao_symptoms %>%
mutate(across(.cols = c(symp_fever, symp_headache, symp_cough),
.fns = toupper))If, instead of c(symp_fever, symp_headache, symp_cough)
you just put in symp_fever, symp_headache, symp_cough,
you’ll get an error:
yao_symptoms %>%
mutate(across(.cols = symp_fever, symp_headache, symp_cough, # Don't do this
.fns = toupper))Error in `mutate()`:
! Problem while computing `..1 = across(.cols = symp_fever, symp_headache....
Other than that, the usual variable selection methods can be used here.
So you can use numeric ranges, like 4:16:
yao_symptoms %>%
mutate(across(.cols = 4:16,
.fns = toupper))Or helper verbs like starts_with():
yao_symptoms %>%
mutate(across(.cols = starts_with("symp_"),
.fns = toupper))Or the function where() to select columns of a
particular type:
yao_symptoms %>%
mutate(across(.cols = where(is.character),
.fns = toupper))Or the catch-all everything():
yao_symptoms %>%
mutate(across(.cols = everything(),
.fns = toupper))Note that everything() is the default value for the
.cols. So the above code, is equivalent to simply
running:
yao_symptoms %>%
mutate(across(.fns = toupper))In the febrile_diseases dataset, the columns from
abd_pain to splenomegaly indicate whether a
patient had a specified symptom, recorded as “yes” or “no”.
febrile_diseases %>%
select(abd_pain:splenomegaly)Use mutate() and across() to convert the
variable levels to uppercase. (That is, “yes” to “YES” and “no” to
“NO”)
Q_febrile_disease_symptoms <-
febrile_diseases %>%
____________________________.fns
argumentNow, on to the second argument in across(). As mentioned
above, this argument takes in the function to be applied across
columns.
You can provide any valid function here. We had previously used
toupper():
yao_symptoms %>%
mutate(across(.cols = symp_fever:symp_stomach_ache,
.fns = toupper))In a similar style, we can also use tolower():
yao_symptoms %>%
mutate(across(.cols = symp_fever:symp_stomach_ache,
.fns = tolower))Of course, the function we apply through across() needs
to be type-appropriate: it should apply to the type
(character, numeric, factor, etc) of the variables we are feeding
in.
yao_symptoms %>%
mutate(across(.cols = symp_fever:symp_stomach_ache,
.fns = log))Error in `mutate()`:
! non-numeric argument to mathematical function
Here we get an error message because we tried to apply a function made for numeric variables to character type variables.
It is a bit confusing to write a function without parentheses, as in
.fns = toupper. There is a difference between
toupper() and toupper.
toupper() calls the function, while toupper
without parenthesis makes a reference to the function. With a reference
to the function, across() will take care of calling it from
its back-end code (the code that defines across(). We call
it “back-end” because it’s “in the back” and you cannot see it unless
you go looking into it explicitly.)
In the febrile_diseases dataset, ensure that all the
columns from abd_pain to splenomegaly,
indicating symptoms of patients, are in lower case. Apply
tolower() across all these variables using
mutate() and across().
Q_febrile_disease_symptoms_to_lower <-
febrile_diseases %>%
____________________________Sometimes it is useful to use a custom function, called a “lambda function” or “anonymous function”. You will see more about functions in later lessons. The idea here is that you write your own operation which will be applied across your selected variables. The writing of these lambda functions has certain strict rules so pay attention to this as we go through several examples.
The toupper example we saw above can be rewritten with
this syntax:
yao_symptoms %>%
mutate(across(.cols = symp_fever:symp_stomach_ache,
.fns = ~ toupper(.x)))In the code .fns = ~ toupper(.x), the tilda,
~, introduces the lambda function, and the .x
references each of the columns across which you are applying the
function. The .x takes the columns one by one and “calls”
the function on each one.
So overall, this code can be read as “apply toupper() to
each of the symptom variables.”
Here is another example, but with tolower():
yao_symptoms %>%
mutate(across(.cols = symp_fever:symp_stomach_ache,
.fns = ~ tolower(.x)))The pattern is quite simple once you get used to it.
Now, with this anonymous function syntax, it becomes very intuitive to use functions that take in multiple arguments.
For example, we could explicit the “No” and “Yes” by pasting into the string what they are referring to, in other words, symptoms:
yao_symptoms %>%
mutate(across(.cols = symp_fever:symp_stomach_ache,
.fns = ~ paste0(.x, " symptoms")))Or we could use str_sub(), a function that allows to
keep a subset of your string:
yao_symptoms %>%
mutate(across(.cols = symp_fever:symp_stomach_ache,
.fns = ~ str_sub(.x, start = 1, end = 1)))In our case our string values are “No” and “Yes” so we will make a substring with just their first letters (“N” and “Y”) to have a one letter encoding.
Or we can recode the “Yes” and “No” entries in a different manner:
yao_symptoms %>%
mutate(across(.cols = symp_fever:symp_stomach_ache,
.fns = ~ if_else(.x == "Yes", "Has symptom", "Does not have symptom")))Now we have the “Yes” encoded as “Has symptom” and the “No” encoded as “Does not have symptom”. These strings are longer but they are clearer in their meaning than just “Yes” vs. “No”.
We could also recode the “Yes” and “No” to numeric values:
yao_symptoms %>%
mutate(across(.cols = symp_fever:symp_stomach_ache,
.fns = ~ if_else(.x == "Yes", 1, 2)))Now we have “Yes” encoded as choice 1 and “No” as 2.
Note that you can chain several mutate() calls
together:
yao_symptoms %>%
mutate(across(.cols = symp_fever:symp_stomach_ache,
.fns = ~ if_else(.x == "Yes", 1, 2))) %>%
mutate(across(.cols = symp_fever:symp_stomach_ache,
.fns = ~ case_when(.x == 1 ~ 1,
.x == 2 ~ 0)))Above, we first convert from “Yes” & “No” to the numeric values 1
and 2, to follow R indexing (R counts from 1 onwards). However, other
programming language start their index at 0, such as Python (Python
counts from 0 onwards). For many machine learning algorithms, your
encoding should be 0 or 1, so this could be a useful conversion of the
encoding of your data. We use case_when() to define which
numerical value should be switched to 1 (TRUE, has symptoms) and which
numerical value should be switched to 0 (FALSE, does not have
symptoms).
The columns from abd_pain to splenomegaly
in the febrile_diseases dataset contain information on
whether a patient had a specified symptom, recorded as “yes” or
“no”.
Use mutate(), across() and an anonymous
function to convert the variable levels to numbers, with “yes” as 1 and
“no” as 0.
Q_febrile_disease_symptoms_to_numeric <-
febrile_diseases %>%
____________________________In the diet dataset, the columns from
retinol to zinc give the number of milligrams
of each nutrient consumed by the surveyed women in a day.
diet %>%
select(retinol:zinc)Use mutate(), across() and an anonymous
function to convert these values to grams (divide by 1000).
Q_diet_to_grams <-
diet %>%
____________________________.names argumentThe examples we have seen so far all involved replacing existing columns.
But what if you want to create new columns instead?
To illustrate this, let’s create a smaller subset of
yao_symptoms:
yao_symptoms_mini <-
yao_symptoms %>%
select(symp_fever, symp_headache, symp_cough)
yao_symptoms_miniNow, to convert all columns to uppercase, we would usually run:
yao_symptoms_mini %>%
mutate(across(.fns = toupper))This code modifies existing columns in place
(Remember that the default argument for .cols is
everything(), so the above code modifies all columns in the
dataset)
Now, if we instead want to make new columns that are
uppercase, we can use the .names argument of
across()
yao_symptoms_mini %>%
mutate(across(.fns = toupper,
.names = "{.col}_uppercase")){.col} represents each of the old column names. The rest
of the string. “_uppercase” is pasted together with the old column
names. So the code "{.col}_uppercase" code can be read as
“for each column, convert to uppercase and name it by pasting the
existing lowercase column name with _uppercase.”
Of course, we can input any arbitrary string:
yao_symptoms_mini %>%
mutate(across(.fns = toupper,
.names = "{.col}_BIG_LETTERS"))If we want the text to come before the old column name, we can also do this:
yao_symptoms_mini %>%
mutate(across(.fns = toupper,
.names = "uppercase_{.col}"))More usefully, we can create a numeric version of these symptoms variables:
yao_symptoms_mini %>%
mutate(across(.fns = ~ if_else(.x == "Yes", 1, 0),
.names = "numeric_{.col}"))Now you will convert again the columns from abd_pain to
splenomegaly in the febrile_diseases dataset,
on patient symptoms, into numerical values. But, you will create new
columns named numeric_abd_pain to
numeric_splenomegaly using the .names argument
within across().
Q_febrile_disease_symptoms_to_numeric_new_variables <-
febrile_diseases %>%
____________________________across() with summarize()To get summary statistics over multiple variables it is often helpful
to use across().
Consider again the columns from retinol to
zinc in the diet dataset, which indicate the
number of milligrams of each nutrient consumed by surveyed Vietnamese
women in a day:
diet %>%
select(retinol:zinc)Imagine you wanted to find the average amount of each nutrient consumed. To do this the usual way, you would need to type:
diet %>%
summarize(mean_retinol = mean(retinol),
mean_alpha_catorene = mean(alpha_catorene),
mean_beta_catorene = mean(beta_catorene),
mean_vitamin_c = mean(vitamin_c),
mean_vitamin_b2 = mean(vitamin_b2)
# And on and on and on for 15 columns
)Of course this is not very efficient.
But with across(), this can be done in just two
lines:
diet %>%
summarize(across(.cols = retinol:zinc,
.fns = mean))And recall that one of the primary benefits of
summarize() is that it facilitates grouped summaries. Well,
we can still use those here!
diet %>%
group_by(age_group) %>%
summarize(across(.cols = retinol:zinc,
.fns = mean))Beautiful! So much information extracted so easily.
Here we grouped the data by age group, then across all the nutrient variables, we calculated their mean by age group. It can be read as: “for the 40-49 years old age group, the mean consumption of retinol is roughly of 1.343 micrograms, which seems lower than for other age groups.”
Let’s see another example.
The columns from is_drug_parac to
is_drug_other in the yaounde dataset indicate,
as 1 or 0, whether or not a survey respondent was treated with the named
drug:
yao_drugs <-
yaounde %>%
select(age_years, sex, date_surveyed, is_drug_parac:is_drug_other)
yao_drugsHow could we count the number of respondents who took each drug?
We can simply take the sum of each column selecting the columns
intelligently and using the sum() function:
yao_drugs %>%
summarize(across(.cols = starts_with("is_drug"),
.fns = sum))Oh no! we get all NAs!
We were smart and selected all our columns using
starts_with() but we forgot to consider that
sum() has na.rm set to FALSE by
default. We need to ensure the na.rm argument is set to
TRUE.
The best way to do this is with lambda/anonymous function syntax:
yao_drugs %>%
summarize(across(.cols = starts_with("is_drug"),
.fns = ~ sum(.x, na.rm = TRUE)))Again, we could also create a grouped summary:
yao_drugs %>%
group_by(sex) %>%
summarize(across(.cols = starts_with("is_drug"),
.fns = ~ sum(.x, na.rm = TRUE)))This last code chunk counts the number of individuals, per sex (group by sex), who have received each drug (summing the number of people across each drug variable).
A final example.
Recall that the 13 columns from symp_fever to
symp_stomach_ache in the yao_symptoms dataset
indicate whether or not each respondent had a specific COVID-compatible
symptom:
yao_symptomsHow would we count the number of people with each symptom using
across().
We have two options.
Option 1: We could first mutate() the “Yes” and “No” to
numeric values:
yao_symptoms %>%
mutate(across(.cols = symp_fever:symp_stomach_ache,
.fns = ~ if_else(.x == "Yes", 1, 0)))And then use sum() within summarize():
yao_symptoms %>%
mutate(across(.cols = symp_fever:symp_stomach_ache,
.fns = ~ if_else(.x == "Yes", 1, 0))) %>%
summarize(across(.cols = symp_fever:symp_stomach_ache,
.fns = sum))Option 2: we could jump directly to summarize(), by
summing with a condition:
yao_symptoms %>%
summarize(across(.cols = symp_fever:symp_stomach_ache,
.fns = ~ sum(.x == "Yes")))This code can be read as: across each symptom column, sum all individuals who have been recorded as receiving that drug (who have a “Yes” data entry for that drug).
In the diet data set, the variables
fao_fgw1 to fao_fgw21 record the number of
calories consumed from different FAO food groups. (FAO stands for “Food
and Agricultural Organization”; the food groups are shown in Appendix
1.).
Use summarize() and across() to calculate
the mean amount of calories obtained from each good group.
Q_diet_FAO_mean <-
diet %>%
____________________________In the febrile_diseases data set, the columns from
abd_pain to splenomegaly in the
febrile_diseases dataset contain information on whether a
patient had a specified symptom, recorded as “yes” or “no”. Use
summarize(), across() to count the number of
people with each symptom.
Q_febrile_disease_symptoms_count <-
febrile_diseases %>%
____________________________In the yaounde data set, calculate the median for the
age, height, weight, number of bedridden days and numer of days off from
work (i.e. from the variable age_years to the variable
n_bedridden_days)
Use summarize() and across(), giving the
.fns argument a lambda function to calculate the median.
Careful ! A lambda function with the right arguments is indispensable,
else you will have an NA median for some of the
variables.
Q_yaounde_median <-
yaounde %>%
____________________________When we explored summarize() we rejoiced with the fact
that we could calculate multiple summary statistics at the same time.
This is also possible within across().
Coming back to the diet data survey from Vietnam, we could calculate both the mean and the median across all the nutrient variables:
diet %>%
summarise(across(.cols = retinol:zinc,
.fns = list(mean = mean,
median = median)))Here it is clear that on all numeric type variables of the data set,
we want to calculate the mean and the median. We can do so by providing
the .fns argument of across() with a list.
Small joke: .fns isn’t “functions” abbreviated plural
for nothing ! If we could only apply one function within
across(), it would have been named .fn
(function singular abbreviated).
This time, for the naming, across() takes care of naming
the resulting summary statistic columns. The syntax is :
list(desired_name_1 = function_1, desired_name_2 = function_2).
Let’s see how you could control the naming even more, when operating on a list of functions:
diet %>%
summarise(across(.cols = retinol:zinc,
.fns = list(average = mean, median = median),
.names = "{.fn}_{.col}"))Here we reference the name of the function using {.fn}
and the name of the column with {.col}. It is important to
note that both abbreviations are singular! They are
singular because they reference the function and the column one by one.
Within the across() procedure, across() takes
the functions and the columns one by one and for each one, takes the
function name, such as average, and the column name, such
as retinol, and makes the summary statistic
average_retinol
(i.e. {.fn}=average and
{.col}=retinol).
As we are discussing mean, median, standard deviation calculations,
we have to anticipate for NA values. Consider the code
below:
diet %>%
summarise(across(.cols = retinol:zinc,
.fns = list(average = ~ mean(.x, na.rm = TRUE),
median = ~ median(.x, na.rm = TRUE)),
.names = "{.fn}_{.col}"))Here we have the same code as above, except we ensure that none of
the means or medians will be NA by adding the
na.rm=TRUE argument to the functions. For this, as we have
seen above, we need to use the lambda/anonymous function style. Here we
are giving the .fns argument a list of lambda
functions.
In the diet data set, calculate the mean and the
standard deviation for kilocalories, water, carbohydrates, fat, and
protein consumed (i.e. from the variable
kilocalories_consumed to the variable
carbs_consumed_grams)
Use summarize() and across(), giving the
.fns argument a list of the desired summary statistics.
Make sure your means are named COLUMN_NAME_mean and your
standard deviations are named COLUMN_NAME_sd.
Q_diet_food_composition_mean_sd <-
diet %>%
____________________________In the febrile_diseases data set, calculate the mean and
the standard deviation for white blood cells,and all other blood
analysis measurements (i.e. from the variable wbc to the
variable relymp_a, seeing Appendix 2 for the detailed names
of the variable name abbreviations)
Use summarize() and across(), giving the
.fns argument a list of the desired summary statistics.
Careful ! You need to give a list of lambda functions to calculate the
mean and standard deviation, paying attention to the na.rm
arguments, else your summary statistics will be set to
NA.
Make sure your means are named COLUMN_NAME_mean and your
standard deviations are named COLUMN_NAME_sd.
Q_febrile_diseases_mean_blood_composition <-
febrile_diseases %>%
____________________________across() can be used inside many different {dplyr}
verbs:
mutate(across(multiple_columns, function(s) to apply))
summarize(across(multiple_columns, function(s) to apply))
The statement defining multiple columns can be:
a list of names
e.g. c(symp_fever, symp_headache, symp_cough)
a range of names e.g. retinol:zinc
a condition: !sex OR
where(is.numeric)
The function(s) to apply across all columns can be:
an existing function of R (such as as.factor,
mean etc.)
a custom (lambda/anonymous) function
a list of existing functions (such as
list(mean = mean, sd = sd))
a list of custom (lambda/anonymous) functions
This was your first approach to across(): congrats for
making it through ! Remember the power of combination of
across() and other verbs. If you feel a summarizing or
mutation operation is identical for more than one variable, then usually
you should think of using across().
In the upcoming lessons we will see some more data wrangling verbs: see you soon !
The following team members contributed to this lesson:
Some material in this lesson was adapted from the following sources:
Summarise each group to fewer rows. (n.d.). Retrieved 21 February 2022, from https://dplyr.tidyverse.org/reference/summarize.html
Create, modify, and delete columns — Mutate. (n.d.). Retrieved 21 February 2022, from https://dplyr.tidyverse.org/reference/mutate.html
Apply a function (or functions) across multiple columns — Across. (n.d.). Retrieved 21 February 2022, from https://dplyr.tidyverse.org/reference/across.html
Artwork was adapted from:
| code | meaning |
| fao_fgw1 | Consumed amount from Foods made from grains |
| fao_fgw2 | Consumed amount from White roots and tubers and plantain |
| fao_fgw3 | Consumed amount from Pulses |
| fao_fgw4 | Consumed amount from Nuts and seeds |
| fao_fgw5 | Consumed amount from Milk and milk products |
| fao_fgw6 | Consumed amount from Organ meat |
| fao_fgw7 | Consumed amount from Meat and poultry |
| fao_fgw8 | Consumed amount from Fish and seafood |
| fao_fgw9 | Consumed amount from Eggs |
| fao_fgw10 | Consumed amount from Dark green leafy vegetables |
| fao_fgw11 | Consumed amount from Vitamin A-rich vegetables, roots and tubers |
| fao_fgw12 | Consumed amount from Vitamin A-rich fruits |
| fao_fgw13 | Consumed amount from Other vegetables |
| fao_fgw14 | Consumed amount from Other fruits |
| fao_fgw15 | Consumed amount from Insects and other small protein foods |
| fao_fgw16 | Consumed amount from Other oils and fats |
| fao_fgw17 | Consumed amount from Savoury and fried snacks |
| fao_fgw18 | Consumed amount from Sweets |
| fao_fgw19 | Consumed amount from Sugar sweetened beverages |
| fao_fgw20 | Consumed amount from Condiments and seasonings |
| fao_fgw21 | Consumed amount from Other beverages and foods |
| Abbreviation | Complete Name |
|---|---|
| WBC | white bloodcell |
| RBC | red bloodcell |
| HGB | hemoglobin |
| PLT | platelet |
| NEUT_A | neutrophils |
| LYMP_A | lymphocytes |
| MONO_A | monocytes |
| EOSI_A | eosinophils |
| BASO_A | basophils |
| NRBC_A | nucleated red blood cells |
| IG_A | immature granulocytes |
| RET_A | reticulocytes |
| ASLYMP_A | antibody-synthesizing lymphocytes |
| RELYMP_A | reactive lymphocytes |